How Crop Type and Climate Zone Influence Agricultural Water Use in California
MEDS
R
Published
December 4, 2025
About
Dataset Description
California produces a large portion of the nation’s fruits, nuts, and vegetables. Irrigated agriculture dominates water use in the state, and the amount of water applied varies by crop type, hydrologic region, and year.1
This dataset provides annual estimates of:
Applied Water per Acre
Hydrologic Regions
20 Crop Categories
Years 2016 - 2020
In this blog post, we would like to dive in to understand how climate region, crop type, and year would influence our Gamma response variable, applied water per acre.
# Clear the environmentrm(list =ls())# Load in librariespacman::p_load(tidyverse, dplyr, here, readxl, pals, broom, knitr, janitor, dagitty, ggdag, ggplot2, kableExtra)
The downloaded binary packages are in
/var/folders/rq/nz4_dfz56932cn8khybs1jm80000gn/T//Rtmp3Yqs8h/downloaded_packages
The downloaded binary packages are in
/var/folders/rq/nz4_dfz56932cn8khybs1jm80000gn/T//Rtmp3Yqs8h/downloaded_packages
Load and Explore Data
In this section, we load the raw Excel file and prepare it for analysis. Real datasets often contain unnecessary columns, inconsistent variable names, or values that must be converted into usable formats. We use functions like clean_names(), select(), and pivot_longer() to transform the data into a tidy structure, where each row corresponds to a single crop-year-region combination. Finally, we filter out missing or zero values so the Gamma model, which requires positive continuous outcomes, can be correctly fit.
# Read in data and clean the name to snake case and drop unused columnswater_raw <-read_excel(here("data","agricultural_water_use_data_2016_2020.xlsx"),sheet ="Statewide_AW_Unit", skip =1) %>%clean_names() %>%select(!starts_with("x"))# Pivot all crop columns, convert variable types, and drop missing rowswater_long <- water_raw %>%pivot_longer(cols =-c(year, hr),names_to ="crop",values_to ="aw_per_acre") %>%mutate(aw_per_acre =as.numeric(aw_per_acre),year =as.integer(year),hr =as.factor(hr),crop =as.factor(crop)) %>%filter(!is.na(aw_per_acre), aw_per_acre >0)
Exploratory Visualization
Before fitting any statistical model, it is essential to understand the distribution of the variables. The violin plot visualizes how applied water per acre varies across crop types and reveals skewness, outliers, and general patterns. This helps justify the modeling choices and provides intuition about which predictors might influence water use.
# Create a consistent color palette for crop typescrop_cols <-setNames(alphabet()[1:length(levels(water_long$crop))], levels(water_long$crop))water_long %>%ggplot(aes(x = crop, y = aw_per_acre, fill = crop)) +geom_violin(trim =FALSE, alpha =0.7) +geom_boxplot(width =0.1, outlier.shape =NA, alpha =0.5) +scale_fill_manual(values = crop_cols) +coord_flip() +theme_bw() +theme(legend.position ="none",plot.title =element_text(size =14, face ="bold", hjust =0),axis.title.y =element_blank()) +labs(y ="Applied Water per Acre")
Distribution of applied water per acre across crop types, using violin + boxplot to visualize skewness and variability.
DAG (Directed Acyclic Graph)
A Directed Acyclic Graph clarifies our assumptions about how variables influence each other. Let’s dive into our variables:
Crop Types influence water use because different crops require different irrigation levels
The hydrologic region affects crop type because different regions have different crops
Both year and hydrologic influence applied water per acre due to climate and resource availability
These relationships reflect scientific justification, including crop, hydrological region, and predictors in the regression model.
Code
# Draw a DAGwater_dag <-dagitty("dag { hr -> crop hr -> aw_per_acre year -> crop year -> aw_per_acre crop -> aw_per_acre}")# Assign coordinate for each variable nodescoordinates(water_dag) <-list(x =c(hr =0, year =2, crop =1, aw_per_acre =3),y =c(hr =2, year =2, crop =1, aw_per_acre =1))# Assign color for the nodesnode_colors <-c(hr ="#7F95D1", year ="#FF82A9", crop ="#FA824C", aw_per_acre ="#1B998B")# Plot DAGggdag(water_dag, text =FALSE, node =FALSE) +geom_dag_node(aes(fill = name), size =12, shape =21, color ="white") +geom_dag_edges() +scale_fill_manual(values = node_colors, name ="Variables",labels =c(aw_per_acre ="Applied Water per Acre",crop ="Crop Type",hr ="Hydrologic Region",year ="Year")) +theme_minimal() +theme(legend.title =element_text(size =15, face ="bold", hjust =0.5),legend.position ="right",legend.text =element_text(size =12, face ="bold"),axis.title =element_blank(),axis.text =element_blank(),axis.ticks =element_blank(),panel.grid =element_blank(),panel.background =element_blank(),plot.margin =margin(10, 10, 10, 10))
DAG representing hypothesized causal relationships among crop type, hydrologic region, year, and applied water per acre.
The DAG summarizes our assumptions about how crop type, hydrologic region, and year influence applied water per acre. Hydrologic region influences crop type because certain crops grow better in specific climates and soils. Both hydrologic region and year influences the response variable due to variation in precipitation, groundwater availability, and climate condition. Year affects crop distribution annually as they response directly to the market and climate.
Statistical Modeling
Why Gamma?
The Gamma Distribution with a log link works best with this data because the applied water per acre is positive and continuous. A positive continuous variable is a numerical value that can take any real number that is greater than zero.
The distribution is right-skewed, meaning most values are small, and a few very extreme values stretch the tail to the right, causing the data to cluster near the lower end and occasional extreme values that create an asymmetrical shape with a long right tail. Also, because it can only be positive.
A multiplicative effect shows scientific results that change the outcome in proportion to the change. Rather than adding or subtracting a fixed amount, we can view the numbers as a percentage difference or twice as large. Multiplying the response is more meaningful than adding a constant amount, which makes the Gamma model fit right in.
Gamma Model in Statistical Notation
The mathematical form of the model explains this notation:
\[
{Applied~Water} \sim Gamma(\mu, \phi) \\\\
log(\mu) = \beta_0 + \beta_1 * (Crop) + \beta_2 * (Hydrologic~Region) + \beta_3 *(Year)
\] The linear predictor for the mean is on the log scale, and the parameter of Gamma consists of the following:
\(\mu\) is expected applied water per acre
\(\phi\) is shape parameter and it controls with its size.
Larger \(\phi\) - distribution becomes more symmetric
Smaller \(\phi\) - distribution becomes more right-skewed and more spread out
Fit Model to Simulated Data
To demonstrate how the Gamma model works under the known parameters, we simulated 200 observations from a Gamma distribution with a log-linear mean structure. We set the \(\beta_0 = 1\) and a slope \(\beta_1 = -0.4\).
Gamma Regression Coefficients from Simulated Data (with 95% CI)
Term
Estimate
Std. Error
Statistic
p-value
CI Lower
CI Upper
(Intercept)
1.039
0.063
16.485
0
0.913
1.167
predictor
-0.435
0.028
-15.321
0
-0.491
-0.376
Parameter Recovery for Simulated Data
During the simulation, the true parameters were:
Intercepts: \(\beta_0 = 1\)
Slope: \(\beta_1 = -0.4\)
The fitted Gamma regression model estimated coefficients very close to the true values we selected. The intercept estimate is approximately 1, and the slope estimate is approximately -0.4, indicating that the Gamma model with a log link successfully recovers the underlying parameters when the model assumptions hold. The simulation validates that the Gamma model applies to the water data.
# Plot the simulated dataggplot(sim_data, aes(predictor, response)) +geom_point(alpha =0.75) +labs(x ="Predictor", y ="Response") +theme(plot.title =element_text(size =14, face ="bold", hjust =0.5))
Simulated Gamma-distributed response data with a log-linear relationship.
ggplot(sim_data, aes(x = predictor, y = response)) +geom_point(alpha =0.5) +geom_smooth(method ="glm",method.args =list(family =Gamma(link ="log")),se =TRUE, color ="red", fill ="lightblue", size =1.1) +theme_bw() +labs(x ="Predictor", y ="Response")
Simulated Data with Fitted Gamma Regression Line
Fit to Real Life Data
Code
# Fit Gamma Regression using log link to model applied water per acrewater_gamma_model <-glm(aw_per_acre ~ crop + hr + year,family =Gamma(link ="log"),data = water_long)# Make the table presentablewater_results <-tidy(water_gamma_model, conf.int =TRUE) %>%mutate(term =gsub("crop", "Crop: ", term),term =gsub("hr", "Region: ", term)) %>%kable(format ="html",caption ="Gamma Regression Coefficients with 95% Confidence Intervals",col.names =c("Term", "Estimate", "Std. Error", "Statistic", "p-value","CI (Lower)", "CI (Upper)"), digits =3) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE, position ="center") %>%row_spec(0, bold =TRUE, background ="#f0f0f0")# Show tablewater_results
Gamma Regression Coefficients with 95% Confidence Intervals
Hydrologic Region Effects with fitted Gamma regression line and 95% confidence interval
Inference
We are interested in if hydrologic region is associated with differences in applied water per acre, after adding all consideration for crop types and year.
Explain Hypothesis in Plain-Language
Some hydrologic regions require more applied water per acre than others, even after adjusting for crop type and year.
Hypothesis
\[
H_0: \beta_r = 0 \quad \text{(no difference in expected water use)} \\
H_A: \beta_r \neq 0 \quad \text{(a region differs from the baseline region)}
\]
Visualizing Hypothesis
# Plot water_long %>%ggplot(aes(x = hr, y = aw_per_acre)) +geom_boxplot(fill ="#7F95D1") +labs(title ="Applied Water per Acre by Hydrologic Region",x ="Hydrologic Region",y ="Applied Water per Acre") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1,size =10))
Distribution of applied water per acre across hydrologic regions.
Conclusion
The box plot of applied water per acre by hydrologic region shows that the Colorado River and South Lahontan regions have the highest median water use. In contrast, regions such as San Francisco and the Central Coast show noticeably low levels. These figures align with the coefficient plot from the Gamma regression model. The hydrologic region is an essential predictor of applied water per acre, even without accounting for other factors.
Reference
California Department of Food and Agriculture. (2024). California agricultural production statistics [web page]. Available: https://www.cdfa.ca.gov/statistics. [Accessed: Dec. 4, 2025]
California Department of Water Resources. (2020). Statewide Agricultural Water Use Data, 2016–2020 [data file]. Available: https://data.ca.gov/dataset/statewide-agricultural-water-use-data-2016-2020. [Accessed: Dec. 4, 2025]
California Department of Food and Agriculture. (2024). California agricultural production statistics [web page]. Available: https://www.cdfa.ca.gov/statistics. [Accessed: Dec. 4, 2025]↩︎
Citation
BibTeX citation:
@online{2025,
author = {},
title = {How {Crop} {Type} and {Climate} {Zone} {Influence}
{Agricultural} {Water} {Use} in {California}},
date = {2025-12-04},
url = {https://jwonyk.github.io/posts/crop-water-gamma/},
langid = {en}
}